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ABSTRACT 


In this paper, an amended K-Means algorithm called K-Means++ 
is implemented for color quantization. K-Means++ is an improvement 
to the K-Means algorithm in order to surmount the random selection 
of the initial centroids. The main advantage of K-Means++ is the centroids 
chosen are distributed over the data such that it reduces the sum of squared 


errors (SSE). K-Means++ algorithm is used to analyze the color distribution 

of an image and create the color palette for transforming to a better quantized 

Keywords: image compared to the standard K-Means algorithm. The tests were 

Col er conducted on several popular true color images with different numbers 

eee quantization of K value: 32, 64, 128, and 256. The results show that K-Means++ 

Image compression clustering algorithm yields higher PSNR values and lower file size compared 

K-Means++ to K-Means algorithm; 2.58% and 1.05%. It is envisaged that this clustering 

Machine learning algorithm will benefit in many applications such as document clustering, 

True color image market segmentation, image compression and image segmentation because 
it produces accurate and stable results. 
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1. INTRODUCTION 

The development of information and communication technologies favors an increasing need 
for image and video data, which require large storage space and high transmission bandwidth. 
Image processing technology has undergone significant growth. Therefore, to save up existing storage 
devices and transmission channels, methods which are able to compress data are widely investigated [1, 2]. 
A true-color image consists of 24-bit of storage, where 8 bits for red, 8 for green, and 8 for blue [3]. 
This kind of image has a big number of pixel data which can represent up to 16,777,216 colors and makes 
its display, processing, transmission, and storage problematic. As a result, color quantization is used to solve 
for many image processing and graphics problems. In the past, color quantization was needed due to 
the limitations of graphics capabilities of display hardware. Color quantization still maintains its practical 
value even though 24-bit display hardware has become common [4-6]. 

Color quantization is a preprocessing technique that is used to reduce the number of colors 
in images such that the reconstructed image should be visually close to the original image. It is able 
to eliminate unnecessary information from images. For instance, unnecessary color information in 
topographic maps needs to be eliminated so as to accurately construct digital evaluation model 
(DEM). Color quantization plays a critical role in many other digital applications such as segmentation, 
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color texture analysis, content-based retrieval, watermarking, text detection, and non-photorealistic 
rendering, and. In general, this technique is done by performing two steps. The first is to select the palette 
design from the colors in the original image. The second step is pixel-mapping, it is done by changing 
each color with the color in the palette. Color quantization is an implementation of lossy image 
compression [7-10]. Figure 1 shows the representation between a 24-bits original image and 64 colors 
of quantized image. 

Currently, there are two algorithms for palette design, clustering-based and splitting-based. Some 
well-known splitting algorithms are the median-cut [11], center-cut [12], Octree [13], variance-based [14], 
RWM.-cut [15], and binary splitting [16]. The splitting-based algorithms generally split the color space 
of the color image into two disjoint groups according to their splitting criteria. Then, the splitting is iterated 
until the wanted number of groups is achieved. Finally, the cluster center of each group is chosen 
as the palette color [17]. Clustering-based algorithms include Hierarchical Clustering, K-Means, 
and K-Medoids [18]. The performance of clustering-algorithm-based techniques varies greatly depending on 
how the K representative colors are chosen. These techniques have to make a tradeoff between their 
computational efficiency and minimization of the distortion measure. For example, the standard K-Means 
algorithm can minimize the quantization error efficiently if enough number of iterations are allowed [19, 20]. 
K-Means clustering algorithm [21] was originally proposed for pattern recognition. It is commonly 
recognized as a sub-optimal quantization technique that can also be applied to color vision, 
image segmentation, and vector quantization. For image vector quantization, the generalized Lloyd 
algorithm, also called the LBG algorithm, is identical to the standard K-means algorithm [22]. 

K-Means++ was proposed by David Arthur and Sergei Vassilvitski in 2007. It is an improvement to 
the standard K-Means in order to surmount the random selection of the initial cluster centers. This algorithm 
uses the squared distance weighting method to select the next center. This algorithm also reduces 
the fluctuation happening in K-Means and provide better and stable clustering results [23]. Authors in [24] 
found that the K-Means++ algorithm has proved to be more accurate than the standard K-Means in crime 
document clustering. Therefore, we investigate the PSNR and file size of quantized images using 
K-Means++ compared to the standard K-Means. 





(a) (b) 


Figure 1. The different representation between, (a) An original image in 24-bits RGB color, 
(b) Quantized image is reduced to a palette of 64 colors 


The rest of the paper is organized as follows; Section 2 describes the research method, including 
the K-means++ algorithm, and PSNR as the quality assessment method. Section 3 presents the results 
and analysis. Finally, Section 4 explains the conclusions and future work. 


2. RESEARCH METHOD 

In this experiment, the image is converted to a collection of RGB color values as X. Identify 
the number of clusters (K value) that is desired to represent the image. The numbers of clusters proposed 
are 32, 64, 128, and 256. The number of clusters must be less than the actual number of colors that exist in 
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the image. The next step is to choose K colors to represent the original image. The chosen colors would act as 
the centroids to classify all the colors in the original image. Then proceed the following K-Means++ 
Clustering algorithm [25]: 
— Select a center c;, which is chosen uniformly at random from X. 

Dixy" 
xex D O 
D(x) denotes the shortest distance from a data point to the closest center we have already chosen. 


— Select a new center c; by choosing x € X with probability 


Doe) = (xi — Cu)? + (Xi — Cj2)* +... +04) — Cir)? (1) 


— Repeat Step 2, k centers have been taken altogether. 
— For each i e {1,...,k}, set the cluster C; to be the set of points in X that are closer to c; than they are to c; 
for all j #1. 


— Foreachie {1.,...,k}, set c;to be the center of mass of all points in C;: ci Dee: 


SS 
-Cil 
— Repeat Step 4 and 5, until C no longer changes. 

After the algorithm is done, each color is looped through and replaced with the center color that 
has the closest distance to it. This step is known as color remapping in color quantization. 
Then, the remapping process would reproduce an image that is visually similar to the original image, but only 
with ‘K’ colors exist in tt. 

Color quantization algorithms efficiency is measured by PSNR (Peak-to-Signal Noise Ratio), 
where a higher PSNR means the quantized image is closer to the original visually [26]. 


M2 
PSNR = 10 log, (rz) (2) 


where M is the original image’s maximum value. The typical value of PSNR for lossy compressed images 
are between 30-50 dB [27]. MSE measures the distortion or deviation between the original image and its 
reconstructed image (quantized image) [28]. 


M N 
1 2 
MSE= ay DD, (Sar Cy) 3) 
x=1 y=1 


3. RESULTS AND DISCUSSION 

To evaluate the PSNR value and file size of K-Means++ clustering algorithm, 4 (four) popular 
images from USC-SIPI University Image Database (http://sipi.usc.edu/database/); “Baboon”, “Fruits”, 
“Lena”, and “Pepper” are used. The images are 512x512 pixels sized. The original image file sizes are as 
follows 624 KB, 464 KB, 464 KB, and 528 KB. Figure 2 shows the tested images. 
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Figure 2. The test images, (a) Baboon, (b) Fruits, (c) Lena, (d) Peppers 


Experiments are implemented using Java programming language on an 17 7700HQ 2.8GHz 
processor, 8GB RAM, and NVIDIA GeForce GTX 1050Ti 4GB VGA. K-Means++ is compared to 
the standard K-Means clustering algorithm. The tests were done 3 times for each ‘K’ value. Table 1 depicts 
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the average PSNR values and file sizes of the tested images. Figure 3 shows the quantization results 
of “Fruits”. Figure 4 shows the comparison between K value and PSNR, while Figure 5 shows 
the comparison between K value and file size. 


Table 1. The average PSNR and file sizes 


K-Means++ K-Means 


No. Image Number of colors Size (KB) K ~ PSNR (dB) Size(KB) | PSNR(dB) ‘Size (KB) 
1 Baboon 230427 622 32 27.07 129 26.31 130.67 
64 28.95 169.67 28.39 170 
128 30.89 215 29.93 218 
256 32.67 213.33 30.80 21333 
2 Fruits 49451 461 32 30.25 61.67 29.90 62 
64 32.85 82 32.53 83.33 
128 329 110.33 34.14 110.67 
256 37.06 137.67 36.86 138.67 
3 Lena 148279 462 32 32.08 93.67 31.00 95 
64 33.81 129.67 32.26 130 
128 34.88 161 32.68 163.33 
256 35.08 175.67 35.05 176 
4 Peppers 183525 526 32 28.56 82.67 28.15 84 
64 30.95 117.67 30.06 120 
128 33.40 159.33 32.34 163.33 
256 34.90 207 34.87 208.67 


(a) (b) (c) 


Figure 3. Quantized images of “Fruits”, (a) K=32, (b) K=64, (c) K=128, (d) K=256 
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Based on the results depicted in Table 1, the quantized images altogether have the average PSNR 
more than 30 dB on K value of 128 and 256. The smallest average file size produced is 61.67 KB on Fruits 
image with K value of 32 and the biggest is 273.33 KB on Baboon image with K value of 256. The average 
file sizes have a scale of 2 to 7 times smaller than the original size. The PSNR and file sizes of K-Means++ 
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clustering are better than K-means clustering with PSNR and file size percentage decreases for 32, 64, 128, 
and 256 clusters are 2.19% and 1.2%, 2.59% and 1%, 3.93% and 1.39%, 1.61% and 0.61%. The standard 
K-Means algorithm initializes the cluster centers uniformly at random, whereas K-Means++ tends 
to initialize near the center of the squares. Therefore, the results given by the standard K-Means are different 
in every run. In this case, the colors chosen as the centers could have adjacent RGB values. So, the PSNR 
and file sizes obtained by the standard K-Means are worse than the K-Means++. Based on Figure 4 and 5, 
more clusters (K) produce a better PSNR and file size, while the file sizes are still much lower than 
the original size. If the number of clusters gets bigger, then the mass of pixels that have similarities with 
the cluster centers produce quantized image that is closer to the original. 


4. CONCLUSION 

By using K-Means++ algorithm, the quantized images altogether have the average PSNR more than 
30 dB on K value of 128 and 256. The average file sizes have a scale of 2 to 7 times smaller compared to 
the original size. The PSNR and file size increase as the K value gets bigger. In addition, the average PSNR 
and file size produced of each quantized image achieves better results compared to the original K-Means 
algorithm; 2.58% and 1.05%. K-Means++ algorithm has been proved to be better and more accurse that 
K-Means, in color quantization. The reason is because the fact that K-Means algorithm initializes the cluster 
centers uniformly at random, whereas K-Means++ tends to initialize near the center of the squares. Based on 
the research conducted, different distance metrics can be used to make a representation 
of the images so that it can be compared in terms of weighting. It is also recommended that the extension 
of this work can be done for images with bigger resolutions. 
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